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Abstract We consider models for multivariate point 
processes where the intensity is given non-parametrically 
in terms of functions in a reproducing kernel Hilbert 
space. The likelihood function involves a time integral 
and is consequently not given in terms of a finite num- 
ber of kernel evaluations. We derive a representation 
of the gradient of the log-likelihood and provide two 
methods for practically computing approximations to 
the gradient by time discretization. We illustrate the 
methods by an application to neuron network model- 
ing, and we investigate how the computational costs of 
the methods depend on the resolution of the time dis- 
cretization. The methods are implemented and avail- 
able in the R-package ppstat. 

Keywords Multivariate point processes • Penaliza- 
tion • Reproducing kernel Hilbert spaces 



1 Introduction 

A network of neurons is a prime example of an inter- 
acting dynamical system, and the characterization and 
modeling of the network activity is a central scientific 
challenge, see e.g. [11] . In this paper we consider mod- 
els and related computational methods appropriate for 
the modeling of the collection of spike times, which can 
be measured simultaneously for multiple neurons. The 
spike times are discrete event times and the appropriate 
modeling framework is that of multivariate counting or 
point process models. Our results are general, but their 
application is illustrated on neuron spike data. 
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With (JVJ) denoting the counting process of spike 
times for neuron i, % = 1, . . . ,p, a classical linear Hawkes 
model, [S] , is specified by letting the intensity of a spike 
for the i'th neuron be given as 



*l = E 
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2 ay(t-«)+/3y dN J 



(1) 



This intensity, or rate, specifies the conditional proba- 
bility of observing a spike immediately after time t in 
the sense that 
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where Tt denotes the history of all spikes preceeding 
time t, see e.g. [IU] or pQ. Note the upper integration 
limit, t—, which means that the integral w.r.t. N% only 
involves spikes strictly before t. This is an essential re- 
quirement for correct likelihood computations, see (13]) 
below. 

Neuron network activity is one example of a mul- 
tivariate interacting dynamical system that is driven 
by discrete events, but there are also other important 
examples. For instance the high-frequency trading of 
multiple financial assets [SJ. Indeed, the linear Hawkes 
model (fTl was also considered in Chapter 7 in [5J . Other 
examples are found in epidemiology, population biology 
and (bio) chemistry. Examples of chemical reaction net- 
works using Markovian multitype birth-death processes 
are found in [2] and [4] . For such models of chemical re- 
actions we introduce a vector of state variables consist- 
ing of the numbers of molecules of each type of reactant 
at a given time. The intensities of reaction events are 
then expressed in terms of the state variables and a col- 
lection of rate parameters. For the model given by (fTl) 
we can also introduce (X^, . . . , Xf) as a vector of state 
variables. This p-dimensional process is then a Markov 
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process and there is a one-to-one correspondance be- 
tween this process and the multivariate counting pro- 
cess (iVj 1 , . . . , Nf). In the special case a^ — the model 
reduces to a linear birth process. In applications we 
are, however, more interested in cases where ay < so 
that the state variables constitute cumulated but expo- 
nentially damped contributions of previous events. The 
Markovian birth-death models and the model given by 
(IT]) share two things; the state variables form a continu- 
ous time Markov process, and the models are specified 
by a finite dimensional parameter vector. The Markov 
property of the state variables is, among other things, 
beneficial from a computational viewpoint, and stan- 
dard techniques can be used for likelihood computa- 
tions and parameter estimation given that we observe 
the complete system. 

Our interest is to generalize the model given by (n]) 
to non-exponential integrands, and, in particular, to al- 
low those integrands to be estimated non-parametrically. 
A consequence is that the Markov property of the state 
variables will be lost. The integral fl]) can be under- 
stood as a linear filter of the multivariate counting pro- 
cess (N\, . . . , Nf), and we will consider the generaliza- 
tion of such linear filters to the case where 
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9ij (t- S )dNj 



(2) 



with gij general functions in a suitable function space. 
We will, moreover, allow for non-linear transformations 
of X\, such that the intensity is given by <p(Xl) for a 
general but fixed function ip. 

In this paper we are particularly concerned with 
efficient computation and minimization of the penal- 
ized negative log-likelihood as a function of the non- 
parametric components gij, with g^ in a reproducing 
kernel Hilbert space H. We consider algorithms for stan- 
dard quadratic penalization y"V H^ijH 2 , with || • || the 
Hilbert space norm on %. We will throughout assume 
that the ^-functions are variation independent, which 
imply that the computation and minimization of the 
joint penalized negative log-likelihood can be split into 
p separate minimization problems. To ease notation we 
will thus subsequently consider the modeling of one 
counting process N t in terms of N t }, . . . , Nf, where N t 
can be any of the p counting processes. 



2 Likelihood computations for point processes 
specified by linear filters 

We assume that we observe a simple counting process 
(N s ) < s <t of discrete events on the time interval [0, £]. 
The jump times of N are denoted T\ < ■ ■ ■ < rjv t • We 



let T-L denote a reproducing kernel Hilbert space of func- 
tions on [0, t] with reproducing kernel R : [0, t] x [0, t] — > 
K, and we let g = (gi, ■ ■ ■ ,g p ) € W. We assume that R 
is continuous in which case the functions in % are also 
continuous, see Theorem 17 in [3]. With N l , . . . ,N P 
counting processes having jump times a^ we introduce 



X s (g) = 



p 



i=i 



gi (s-u)dK 
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As a function of g we note that X s : W — > R being a 
sum of function evaluations is a continuous linear func- 
tional. The process X s (g) is called the linear predictor 
process. We consider the model of N where the inten- 
sity is given as X s (g) = (p{X s (g)) with ip : M —> [0, oo) 
a known function, and the objective is to estimate the 
(^-functions in %. 

From Corollary II. 7. 3 in [1] it follows that the nega- 
tive log-likelihood w.r.t. the homogeneous Poisson pro- 
cess is given as 



e(g) 



<p(X s (g))ds 
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(3) 



If <p is the identity the time integral has a closed form 
representation in terms of the antiderivatives of gi, but 
in general it has to be computed numerically. 

The following proposition gives the gradient of £ in 
the reproducing kernel Hilbert space. This result is cen- 
tral for our development and understanding of a practi- 
cally implementable minimization algorithm of the pe- 
nalized negative log-likelihood. 

Proposition 1 If tp is continuously differentiable the 
gradient in % w.r.t. gi is 

Vi%)=E I y'(X s {g))R{ S -a),-)ds 
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v(X Th (g)) 



(4) 



The proof of Proposition [T] is given in Section [6] 
It is a special case of Proposition 3.6 in [7] if H is a 
Sobolev space. However, since we restrict attention to 
counting process integrators in this paper, in contrast to 
[7] where more general integrator processes are allowed, 
we can give a relatively elementary proof for % being 
any reproducing kernel Hilbert space with a continuous 
kernel. 

Computations of I as well as the gradient involve the 
computation of X s (g) . Without further assumptions a 
direct computation of X s (g) on a grid of n time points 
involves in the order of n YTi=i ^t evaluations of the 
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(^-functions. In comparison, (nj) can be computed re- 
cursively with the order of np evaluations of the expo- 
nential function. 

In this paper we consider three techniques for re- 
ducing the general costs of computing X s (g). 

— Bounded memory. The filter functions gi are re- 
stricted to have support in [0, A] for a fixed A. 

— Preevaluations. The filter functions are preevalu- 
ated on a grid in [0, A]. 

— Basis expansions. The filter functions are of the 
form g — ^2 k PkBk for fixed basis functions Bk and 

X s {g) = Y,^X 8 {B k ). 

k 

The linear filters X s {B k ) are precomputed. 



3 Time discretization 

In this section we discuss the time discretizations nec- 
essary for the practical implementation of an optimiza- 
tion algorithm in T~L. We assume that all filter functions 
gi have a prespecified support restricted to [0, A], and 
that % is restricted to be a space of functions with sup- 
port in [0, A]. We approximate time integrals by left 
Riemann sums with functions evaluated in the grid 



= t < t\ < 



<tr. 



t 



and corresponding interdistances Ai = ti — ti-i for I = 
1, . . . , n. We will assume that the collection of jump 
times for N l for i = 1, . . . ,p is a subset of this grid and 
denote the corresponding subset of indices by /jump ^= 
{0,...,n}. 

We need an implement able representation of the lin- 
ear predictor as well as the functional gradient. A pos- 
sible representation of gi itself is via the TV-dimensional 
vector gi of its evaluations in a grid 

= 5 < Si < . . . < 5 N = A., 

that is, g ik = gi{5 k ) for k = 0, . . . , N - 1. We let g 
denote the N x p matrix with columns gi's for i = 
1, . . . ,p. Define 



huk = #{j I 4 < U 



a) < 4+lHO] < ti) 



as the number of jumps for N l in (t; — 6k+i,ti — 5k]- 
The indicator l(er* < t{) ensures that if ti = er' ; then 
huo — 0, which, in turn, ensures that the approxima- 
tion of the linear predictor below does not anticipate 
jumps. It is the intention that the grids are chosen such 
that the /life's take the values and 1 only. The linear 



predictor for given g^s evaluated in the grid points is 
approximated as 



6 ' = 2_^ hlikgik 



(5) 



■i . k 



£ 9(tl 

j:ti-A<a)<ti 



g(U - u)dJVJ. 



Strictly speaking, for correct handling of the lower limit 
in the integral, /im(JV-i) should be redefined to be 1 
if a 1 : = ti — A. Such a redefinition will typically have 
no detectable consequences, whereas handling the case 
a 1 , = ti correctly is crucial to avoid making the approx- 
imation anticipating. An approximation of the negative 
log-likelihood in g is then obtained as 



l Zeijump 



(6) 



If we use the same <5-grid for evaluating the kernel R 
we get the gradient approximation from Proposition [l] 

Vi r pprox ( g ) = ]T ]Y,</&)A l h lik ) R(Sk,-) 



v *e/ ju „ 



We observe that 

V^PP rox (g)espan{i?(«5 ,-),-- 



huk \R(Sk,-)- (7) 



,R(5 N ^-)}. 



The consequence is that any descent algorithm based 
on Vi£ approx (g) stays in the finite dimensional subspace 
spanned by R(5q, ■),..., R(6n-i, ■) _ if we start in this 
subspace. As we show below, there is a unique element 
in this subspace with evaluations gi, and the discretiza- 
tion effectively restricts gi to be a function in this sub- 
space. 



3.1 Method A - the direct approach 

The N x N Gram matrix G is given as Gjy = R(Sk, 5i). 
The vector g^ can be identified with the unique function 
9i = Y,k Pik R ( S k, •) obtained by solving 



This is the minimal norm element whose evaluations 
coincide with g^. Since the Gram matrix is positive defi- 
nite, G = UU T where the columns of U are orthogonal. 
This gives that 
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Fig. 1 Left: Data example consisting of three spike tracks from five independent multichannel measurements of turtle spinal 
neurons during a stimulation period. Middel: Estimates of the hi's in the model of v2.2 using method A to minimize QSp 
with n = 7752 and N = 120, Th e value of A = 1.2 x 10 -4 was chosen data adaptively. Right: Similar estimates of the h^s 
using method B to minimize ( 10 I with q = 50, with the basis functions being B-splines and with A = 9.5 x 10 -7 chosen data 



adaptively. The point-wise approximative 95% confidence intervals were obtained using a sandwich estimator of the asymptotic 
parameter variance. 



Note how the ffi- and thus the /^-parameter represen- 
tation of the evaluations Vi£ approx (g)(<5/ c ) can be read 
of directly from fcf\. We observe that the squared norm 
of gi equals 



5l || 2 = (/3°) T G/3?H|ft|| 



with || • ||2 denoting the ordinary Euclidean norm on 
R . The parametrization in terms of /3i is thus an isom- 
etry from ~R N into "H. The objective function - the pe- 
nalized negative log-likelihood approximation - can be 
computed as 



£ apprax {U/3) + \J2\\fr 



>II2 



(8) 



using (pi) and the /^-gradient can be computed as 

[/ T vf£ approx (C//3) + 2A/3 4 
where 

vfr pprox (g) fc = 5>'(&)A/*/ 
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Hik 



(9) 



Using ^ and ^ - and ^ - requires the computation 
of huk- This can either be done on-the-fly (a matrix free 
method) or by precomputing the n x (pr)-dimensional 
sparse matrix H = Qiuk)- 



3.2 Method B - using basis expansions 
Choose a set of basis functions B\, . . . , B q such that 
span{Bi,... ,B q } C span{i?(J , •)> ■ • -,R(Sn-i,')}- 



Precompute the 11x5 model matrices Z 1 of basis filters 

Zj j = J^hukB^k). 
k 

With gi — J2j0ijBji * ae "•-dimensional linear predic- 
tor is given as £ = £\ Z l /3f and f*Ppr°*(/3°) can be com _ 
puted using (km). The Gram matrix, G, is given by G&; = 
(Bf., B[).It is positive definite, thus with VGV T = I for 
V a matrix with orthogonal columns we reparametrize 
in terms of /% = V~ l (3f. In this parametrization 

\\ 9l \\i = (f3°fGtf = m\\i 

thus fti provides an isometric parametrization from R q 
into T-L. The objective function becomes 



£ approx (y#)+A^j 



H\\2 



(10) 



and the gradient is 

Xy««)^(ZftO T - J2 ^(ZfV) T + 2A(, 



iei jump 
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= v T iy««)4,(zjr- E 

= y T v O z approx (y/3)+2A ^ 
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is the gradient in the /3f parametrization. 
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Fig. 2 Top: Memory usage for storing the H-matrix for method A (•) and the Z-matrix for method B (•) for N ranging from 
40 to 400. Middel: Computation time for computing the negative log-likelihood. Bottom: Computation time for computing the 
gradient. 



4 Results 

We have implemented Method A and Method B and 
applied them to a test data set of neuron spike times. 
The data set consisted of multichannel measurements 
of spinal neurons from a turtle. The measurements were 
replicated 5 times and each time the spike activity was 
recorded over a period of 40 seconds. A 10 seconds stim- 
ulation was given within the observation window. We 
used the spike times for 3 neurons during the stimula- 
tion period, see Figure [l] 

The likelihood and gradient algorithms are imple- 
mented in the R-package ppstat, which supports opti- 
mization of the objective functions via the R-function 
optim using the BFGS-algorithm. The ppstat package 
offers a formula based model specification with an in- 
terface familiar from glm. Method A is implemented via 
the ppKernel function, with a typical call of the form 



ppKernel(v2.2 



) 



k(vl3.2) + k(v2.2) + k(v5.1), 
data = spikeData, 
family = HawkesC'logaf f ine") , 
support = 0.2, 
N = 120 



The data set contained in the object spikeData must 
be of class MarkedPointProcess from the supporting 
R-packagc processdata. Method B is implemented via 
the similar ppSmooth function. The choice of ip is speci- 
fied by the "inverse link function" - being "logaf f ine" 
in the call above. This function will be used throughout, 
and it is given as 

, . f e x for x < 

^ x) = \x + l forz>0. 

It maps R into (0, oo) and is continuously differentiable. 
The benefit of using this ip over the exponential function 
is that the exponential function often produces models 
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that are unstable or even explodes in finite time. We will 
not pursue the details. See [5] for details on stability. 

Figure [I] shows the estimated h^s obtained using 
either method A with the Sobolev kernel or method B 
with a B-spline basis. The estimates were computed by 
minimizing pi) and ( 10 1, respectively. The choice of the 
penalization parameter was made data adaptively by 
minimizing a TIC-criterion, see [6]. We will not pur- 
sue the details of the model selection procedure here, 
but focus on the efficiency of the computations of the 
likelihood and gradient. 

We investigated the memory usage and the com- 
putation times of both methods. The memory usage 
was obtained using the R-function object, size and 
the computation times were computed as the average 
of 40 replicated likelihood or gradient evaluations. The 
interest was on how they scale with the numbers n and 
N that determine the resolution of the time discretiza- 
tion. The number of basis functions for method B was 
kept fixed and equal to q = 50. Due to this fact N 
only affects the precomputation of the model matrices 
and not the likelihood and gradient computations for 
Method B. The implementation relies on precomputa- 
tion of the H or Z matrices, which are stored as sparse 
matrices as implemented in the R-package Matrix. 

Figure [2] shows that Method B used more memory 
for storing Z and that the memory usage as a function 
of n had a somewhat larger slope for Method B than for 
Method A. We should also note that the memory usage 
for Method A does not depend upon TV. Storing the 
matrices as non-sparse matrices the Z-matrix required 
61 MB and the H-matrix required 464 MB for n = 
50000 and N — 400. In comparison, the sparse versions 
required 8 MB and 1.5 MB, respectively. 

Figure [2] shows, furthermore, that the methods were 
comparable in terms of the time to compute the like- 
lihood - with Method A being slightly faster. There 
were again no detectable dependence on N for Method 
A. For the gradient computations, Figure [2] shows that 
Method A was slowest and scaled badly with N. 



sparse matrix. However, the gradient computations for 
Method A were clearly dominated by the iV 2 scaling. 

The storage requirements for Z can easily become 
prohibitively large. A choice of basis functions with lo- 
cal support in Method B, such as B-splines used here, 
can compensate partly for this. It is unlikely that it 
is useful to precompute Z l V, as this will destroy the 
benefits of a local basis. 

For Method B it is possible to precompute the model 
matrix in a sligthly different and more direct way. In- 
stead of precomputing the q x N basis function evalu- 
ations Bj(8k) we can compute ZJ. directly as 



J U 



k\t l —A<a i k <t i 



Bjfa-vi). 



This may be more accurate but since n 3> N in typical 
applications this comes at the cost of many more basis 
function evaluations. Whether this is critical in terms 
of the time to compute Z 1 depends upon how costly a 
single basis function evaluation is relative to the com- 
putation of the huk's. We have not presented data on 
the computational costs of the precomputations, but 
they were observed to be small compared to the costs 
of the actual optimization. 

6 Proof of Proposition [I] 

First note that since function evaluations are repre- 
sented in terms of the kernel by inner products we have 
that 

x s(g) = it E (#( s -4-): ft) 

i=1 j:<rj<s 

= E(E fl(*-4-)» 9i). (11) 

If ip is a continuous differentiable function we find 
that 
iP(X s (g + eh)) - iP(X s (g)) 



5 Discussion 

The two methods, Method A and Method B, considered 
in this paper differ in terms of what is precomputcd. 
Computing the matrix H upfront (or computing it on- 
thc-fly) as in Method A should require only a fraction of 
the memory for storing the Z-matrices as most entries 
are 0. This was confirmed by our implementation. We 
also confirmed that neither the storage requirements 
nor the likelihood computations for Method A depend 
on the number N of <5-grid points when H is stored as a 



iP(X s (g) + eX 3 (h)) - tP(X s ( 9 )) 



il>'(X s (g))X a {h) 



for e — > 0. This is clearly a continuous linear functional. 



Using (11) and differentiating only w.r.t. the i'th coor- 



dinate of g we find that the corresponding gradient in 
His 

j:er*.<s 

Taking ip — log ip this yields the gradient of the second 
term in the negative log- likelihood, X)fc=i 1°S fi^-n, (<?)), 
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directly. For the first term we take ip = tp, but we need 
to ensure that we can interchange the order of integra- 
tion and differentiation. To this end the following norm 
bound on V i(p(X s {g)) is useful 



\VMX s (g))\\ < \<p'(X a (g))\ 



\\R{s-o) 



< C t N l t sup y/R{s,s) < oo. 
se[o,t] 

Here C t = sup se r 0]t i \<p'(X s (g))\ is finite because X s (g) 
is continuous in s and Lp' is assumed continuous. We 



have also used that \\R(s — c!-, 



R(s 



and the fact that R is continuous to conclude that the 
bound is finite. The bound shows that 

J2 I v'(X 8 (g))R( S -a*,-)ds 

is an element in rl and the required interchange of in- 
tegration and differentiation is justified by the bound. 
This completes the proof. □ 
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